# Chapter 10 Instruction Scheduling





Can we generate code to leverage modern CPUs for better instruction-level parallelization?





- Every modern high-performance processor can execute several operations in a single clock cycle.
- Billion-Dollar Question: how fast can a program be run on a processor with instruction-level parallelism?





- Every modern high-performance processor can execute several operations in a single clock cycle.
- Billion-Dollar Question: how fast can a program be run on a processor with instruction-level parallelism?

The potential parallelism in the program

The available parallelism on the processor

Our ability to extract parallelism from the code

Our ability to find the best scheduling





- Every modern high-performance processor can execute several operations in a single clock cycle.
- Billion-Dollar Question: how fast can a program be run on a processor with instruction-level parallelism?

The potential parallelism in the program

The available parallelism on the processor

Our ability to extract parallelism from the code

Our ability to find the best scheduling



Instruction pipelines

```
IF
       ID
                       \operatorname{IF}
        EX
                       ID
                                      \operatorname{IF}
        MEM
                       \mathbf{E}\mathbf{X}
                                      ID
                                                     \operatorname{IF}
                       MEM
                                      \mathbf{E}\mathbf{X}
                                                      ID
        WB
                                                                     \operatorname{IF}
6.
                                      MEM
                       WB
                                                      EX
                                                                     ID
                                       WB
                                                      MEM
                                                                     \mathbf{E}\mathbf{X}
8.
                                                                     MEM
                                                      WB
                                                                     WB
9.
```



Instruction pipelines

```
IF
       ID
                     \operatorname{IF}
       EX
                     ID
                                   \operatorname{IF}
       MEM
                     EX
                                   ID
                                                \operatorname{IF}
5.
       WB
                     MEM
                                   \mathbf{E}\mathbf{X}
                                                 ID
                                                              \operatorname{IF}
6.
                     WB
                                   MEM
                                                 EX
                                                              ID
                                   WB
                                                 MEM
                                                               \mathbf{E}\mathbf{X}
8.
                                                               MEM
                                                 WB
                                                               WB
9.
```



Instruction pipelines

| 1. | $\operatorname{IF}$    |                        |                         |                        |                        |
|----|------------------------|------------------------|-------------------------|------------------------|------------------------|
| 2. | ID                     | $\mathbf{IF}$          |                         |                        |                        |
| 3. | $\mathbf{E}\mathbf{X}$ | ID                     | $\operatorname{IF}$     |                        |                        |
| 4. | MEM                    | $\mathbf{E}\mathbf{X}$ | ID                      | $\operatorname{IF}$    |                        |
| 5. | WB                     | MEM                    | EX                      | ID                     | IF                     |
| 6. |                        | WB                     | $\overline{\text{MEM}}$ | $\mathbf{E}\mathbf{X}$ | ID                     |
| 7. |                        |                        | WB                      | MEM                    | $\mathbf{E}\mathbf{X}$ |
| 8. |                        |                        |                         | WB                     | MEM                    |
| 9. |                        |                        |                         |                        | WB                     |
|    |                        |                        |                         |                        |                        |



• Instruction pipelines Note: an instruction may not contain all five phases

| 1. | IF                     |                        |                         |                        |                        |
|----|------------------------|------------------------|-------------------------|------------------------|------------------------|
| 2. | ID                     | $\operatorname{IF}$    |                         |                        |                        |
| 3. | $\mathbf{E}\mathbf{X}$ | ID                     | $\operatorname{IF}$     |                        |                        |
| 4. | MEM                    | $\mathbf{E}\mathbf{X}$ | ID                      | $\operatorname{IF}$    |                        |
| 5. | WB                     | MEM                    | EX                      | ID                     | IF                     |
| 6. |                        | WB                     | $\overline{\text{MEM}}$ | $\mathbf{E}\mathbf{X}$ | ID                     |
| 7. |                        |                        | WB                      | MEM                    | $\mathbf{E}\mathbf{X}$ |
| 8. |                        |                        |                         | WB                     | MEM                    |
| 9. |                        |                        |                         |                        | WB                     |
|    |                        |                        |                         |                        |                        |



- Machine types
  - VLIW (Very long instruction machine) machine
    - Instruction words are longer, encode the operations to be issued in a single clock
    - Compilers decide which operations are scheduled in parallel by encoding such info into the instructions



- Machine types
  - VLIW (Very long instruction machine) machine
  - Superscalar machine
    - Regular instruction set with an ordinary sequential-execution semantics
    - Automatically detect dependencies and issue them in parallel

```
1. IF
2. ID IF
3. EX ID IF
4. MEM EX ID IF
5. WB MEM EX ID IF
6. WB MEM EX ID
7. WB MEM EX
8. WB MEM
9. WB WB
```



- Machine types
  - VLIW (Very long instruction machine) machine
  - Superscalar machine
  - "Out-of-order" machine
    - Automatically issue as many instructions as possible
    - Stop them until all operands are ready



- Machine types
  - VLIW (Very long instruction machine) machine
  - Superscalar machine
  - "Out-of-order" machine
- Why compilers matter?



- Machine types
  - VLIW (Very long instruction machine) machine
  - Superscalar machine
  - "Out-of-order" machine
- Why compilers matter?
  - Hardware only can execute instructions that have been fetched



- Machine types
  - VLIW (Very long instruction machine) machine
  - Superscalar machine
  - "Out-of-order" machine

- Why compilers matter?
  - Hardware only can execute instructions that have been fetched
  - Hardware has limited space to buffer operations that must be stalled



- Machine types
  - VLIW (Very long instruction machine) machine
  - Superscalar machine
  - "Out-of-order" machine

- Why compilers matter?
  - Hardware only can execute instructions that have been fetched
  - Hardware has limited space to buffer operations that must be stalled
  - Compilers can place independent operations close together to allow better hardware utilization





- Every modern high-performance processor can execute several operations in a single clock cycle.
- Billion-Dollar Question: how fast can a program be run on a processor with instruction-level parallelism?

The potential parallelism in the program

The available parallelism on the processor

Our ability to extract parallelism from the code

Our ability to find the best scheduling



- Example: Superscalar machine (1 functional unit)
  - LD/ST 3 cycles; MUL 2 cycles; ADD 1 cycle

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |



- Example: Superscalar machine (1 functional unit)
  - LD/ST 3 cycles; MUL 2 cycles; ADD 1 cycle

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |



- Example: Superscalar machine (1 functional unit)
  - LD/ST 3 cycles; MUL 2 cycles; ADD 1 cycle

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |



- Example: Superscalar machine (1 functional unit)
  - LD/ST 3 cycles; MUL 2 cycles; ADD 1 cycle

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |



- Example: Superscalar machine (1 functional unit)
  - LD/ST 3 cycles; MUL 2 cycles; ADD 1 cycle

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |



- Example: Superscalar machine (1 functional unit)
  - LD/ST 3 cycles; MUL 2 cycles; ADD 1 cycle

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |



- Example: Superscalar machin
  - LD/ST 3 cycles; MUL 2 cycles; Al

| LD  | R1, | a   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

a = 2 \* a \* b \* c



| 1  | LD | R1, | а | •••••    |
|----|----|-----|---|----------|
| 2  |    |     |   |          |
| 3  |    |     |   | R1 ready |
| 4  |    |     |   |          |
| 5  |    |     |   |          |
| 6  |    |     |   |          |
| 7  |    |     |   |          |
| 8  |    |     |   |          |
| 9  |    |     |   |          |
| 10 |    |     |   |          |
| 11 |    |     |   |          |
| 12 |    |     |   |          |
| 13 |    |     |   |          |
| 14 |    |     |   |          |
| 15 |    |     |   |          |
| 16 |    |     |   |          |



- Example: Superscalar machin
  - LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

schedule

| LD  | R1, | a   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

| S   | ST  |   |   | a, |   |   |  |
|-----|-----|---|---|----|---|---|--|
| a = | 2 * | а | * | b  | * | С |  |

| 1  | LD  | R1, | а   |    |          |
|----|-----|-----|-----|----|----------|
| 2  |     |     |     |    |          |
| 3  |     |     |     |    | R1 ready |
| 4  | ADD | R1, | R1, | R1 | R1 ready |
| 5  |     |     |     |    |          |
| 6  |     |     |     |    |          |
| 7  |     |     |     |    |          |
| 8  |     |     |     |    |          |
| 9  |     |     |     |    |          |
| 10 |     |     |     |    |          |
| 11 |     |     |     |    |          |
| 12 |     |     |     |    |          |
| 13 |     |     |     |    |          |
| 14 |     |     |     |    |          |
| 15 |     |     |     |    |          |
| 16 |     |     |     |    |          |



#### Example: Superscalar machiq

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

a = 2 \* a \* b \* c



| 1   | LD  | R1, | a   |    | •••••       |
|-----|-----|-----|-----|----|-------------|
| 2   |     |     |     |    |             |
| 3   |     |     |     |    | R1 ready    |
| 4   | ADD | R1, | R1, | R1 | R1 ready    |
| - 5 | LD  | R2, | b   |    | R1 ready    |
| 6   |     |     |     |    | R1 ready    |
| 7   |     |     |     |    | R1/R2 ready |
| 8   |     |     |     |    |             |
| 9   |     |     |     |    |             |
| 10  |     |     |     |    |             |
| 11  |     |     |     |    |             |
| 12  |     |     |     |    |             |
| 13  |     |     |     |    |             |
| 14  |     |     |     |    |             |
| 15  |     |     |     |    |             |
| 16  |     |     |     |    |             |



R1 ready

R1 ready

## Instruction Sche

#### • Example: Superscalar maching 4

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

| <ul> <li>LD/ST 3 cycles; MUL 2 cycles;</li> </ul> | <u>5</u> | LD F  | R2, | b   |    | R1 ready    |
|---------------------------------------------------|----------|-------|-----|-----|----|-------------|
|                                                   | 6        |       |     |     |    | R1 ready    |
|                                                   | 7        |       |     |     |    | R1/R2 ready |
| LD R1, a                                          | 8        | MUL F | R1, | R1, | R2 | R2 ready    |
| ADD R1, R1, R1                                    | 9        |       |     |     |    |             |
| LD R2, b                                          | 10       |       |     |     |    |             |
| MUL R1, R1, R2                                    | 11       |       |     |     |    |             |
| LD R2, c                                          | 12       |       |     |     |    |             |
| MUL R1, R1, R2                                    | 13       |       |     |     |    |             |
| ST a, R1                                          |          |       |     |     |    |             |
|                                                   | 14       |       |     |     |    |             |
| a = 2 * a * b * c                                 | 15       |       |     |     |    |             |
|                                                   | 16       |       |     |     |    |             |
|                                                   |          |       |     |     |    |             |

LD

R1, a

ADD R1, R1, R1



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

a = 2 \* a \* b \* c



| 1  | LD  | R1, | а   |    | •••••       |
|----|-----|-----|-----|----|-------------|
| 2  |     |     |     |    |             |
| 3  |     |     |     |    | R1 ready    |
| 4  | ADD | R1, | R1, | R1 | R1 ready    |
| 5  | LD  | R2, | b   |    | R1 ready    |
| 6  |     |     |     |    | R1 ready    |
| 7  |     |     |     |    | R1/R2 ready |
| 8  | MUL | R1, | R1, | R2 | R2 ready    |
| 9  |     |     |     |    | R1/R2 ready |
| 10 |     |     |     |    |             |
| 11 |     |     |     |    |             |
| 12 |     |     |     |    |             |
| 13 |     |     |     |    |             |
| 14 |     |     |     |    |             |
| 15 |     |     |     |    |             |
| 16 |     |     |     |    |             |



- Example: Superscalar machine
  - LD/ST 3 cycles; MUL 2 cycles; Al

schedule

| LD  | R1, | а   |    |   |
|-----|-----|-----|----|---|
| ADD | R1, | R1, | R1 |   |
| LD  | R2, | b   |    |   |
| MUL | R1, | R1, | R2 |   |
| LD  | R2, | С   |    | 1 |
| MUL | R1, | R1, | R2 |   |
| ST  | a,  | R1  |    |   |

a = 2 \* a \* b \* c

| 1        |             | LD  | R1, | а   |    |             |
|----------|-------------|-----|-----|-----|----|-------------|
| 2        |             |     |     |     |    |             |
| 3        |             |     |     |     |    | R1 ready    |
| 4        |             | ADD | R1, | R1, | R1 | R1 ready    |
| <b>5</b> |             | LD  | R2, | b   |    | R1 ready    |
| 6        |             |     |     |     |    | R1 ready    |
| 7        |             |     |     |     |    | R1/R2 ready |
| 8        |             | MUL | R1, | R1, | R2 | R2 ready    |
| 9        |             |     |     |     |    | R1/R2 ready |
| 10 -     | <b>&gt;</b> |     |     |     |    |             |
| 11       |             |     |     |     |    |             |
| 12       |             |     |     |     |    |             |
| 13       |             |     |     |     |    |             |
| 14       |             |     |     |     |    |             |
| 15       |             |     |     |     |    |             |
| 16       |             |     |     |     |    |             |



- Example: Superscalar machin
  - LD/ST 3 cycles; MUL 2 cycles; Al

schedule

| LD  | R1, | а   |    |    |
|-----|-----|-----|----|----|
| ADD | R1, | R1, | R1 |    |
| LD  | R2, | b   |    |    |
| MUL | R1, | R1, | R2 |    |
| LD  | R2, | С   |    | 11 |
| MUL | R1, | R1, | R2 |    |
| ST  | a,  | R1  |    |    |

| a = | 2 | * | а | * | b | * | С |
|-----|---|---|---|---|---|---|---|
|     |   |   |   |   |   |   |   |

| 1    |    | LD  | R1, | a   |    |             |
|------|----|-----|-----|-----|----|-------------|
| 2    |    |     |     |     |    |             |
| 3    |    |     |     |     |    | R1 ready    |
| 4    |    | ADD | R1, | R1, | R1 | R1 ready    |
| 5    |    | LD  | R2, | b   |    | R1 ready    |
| 6    |    |     |     |     |    | R1 ready    |
| 7    |    |     |     |     |    | R1/R2 ready |
| 8    |    | MUL | R1, | R1, | R2 | R2 ready    |
| 9    | >  |     |     |     |    | R1/R2 ready |
| 10 - | -> |     |     |     |    |             |
| 11   |    |     |     |     |    |             |
| 12   |    |     |     |     |    |             |
| 13   |    |     |     |     |    |             |
| 14   |    |     |     |     |    |             |
| 15   |    |     |     |     |    |             |
| 16   |    |     | _   | _   |    |             |



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

<u>schedule</u>

| 1  | LD  | R1, | a   |    | •••••       |
|----|-----|-----|-----|----|-------------|
| 2  |     |     |     |    |             |
| 3  |     |     |     |    | R1 ready    |
| 4  | ADD | R1, | R1, | R1 | R1 ready    |
| 5  | LD  | R2, | b   |    | R1 ready    |
| 6  |     |     |     |    | R1 ready    |
| 7  |     |     |     |    | R1/R2 ready |
| 8  | MUL | R1, | R1, | R2 | R2 ready    |
| 9  | LD  | R2, | С   |    | R1 ready    |
| 10 |     |     |     |    | R1 ready    |
| 11 |     |     |     |    | R1/R2 ready |
| 12 |     |     |     |    |             |
| 13 |     |     |     |    |             |
| 14 |     |     |     |    |             |
| 15 |     |     |     |    |             |
| 16 |     |     |     |    |             |
|    |     |     |     |    |             |



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |

<u>scnedule</u>

| 1  | LD  | R1, | a   |    | •••••       |
|----|-----|-----|-----|----|-------------|
| 2  |     |     |     |    |             |
| 3  |     |     |     |    | R1 ready    |
| 4  | ADD | R1, | R1, | R1 | R1 ready    |
| 5  | LD  | R2, | b   |    | R1 ready    |
| 6  |     |     |     |    | R1 ready    |
| 7  |     |     |     |    | R1/R2 ready |
| 8  | MUL | R1, | R1, | R2 | R2 ready    |
| 9  | LD  | R2, | С   |    | R1 ready    |
| 10 |     |     |     |    | R1 ready    |
| 11 |     |     |     |    | R1/R2 ready |
| 12 | MUL | R1, | R1, | R2 | R2 ready    |
| 13 |     |     |     |    | R1/R2 ready |
| 14 |     |     |     |    |             |
| 15 |     |     |     |    |             |
| 16 |     |     |     |    |             |



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |

<u>schedule</u>

| 1  | LD  | R1, | a   |    |             |
|----|-----|-----|-----|----|-------------|
| 2  |     |     |     |    |             |
| 3  |     |     |     |    | R1 ready    |
| 4  | ADD | R1, | R1, | R1 | R1 ready    |
| 5  | LD  | R2, | b   |    | R1 ready    |
| 6  |     |     |     |    | R1 ready    |
| 7  |     |     |     |    | R1/R2 ready |
| 8  | MUL | R1, | R1, | R2 | R2 ready    |
| 9  | LD  | R2, | С   |    | R1 ready    |
| 10 |     |     |     |    | R1 ready    |
| 11 |     |     |     |    | R1/R2 ready |
| 12 | MUL | R1, | R1, | R2 | R2 ready    |
| 13 |     |     |     |    | R1/R2 ready |
| 14 | ST  | а,  | R1  |    | •••••       |
| 15 |     |     |     |    | •••••       |
| 16 |     |     |     |    | Done!       |



R1 ready

R1 ready

R1 ready

## Instruction Sche

#### Example: Superscalar machil<sup>4</sup>

• LD/ST 3 cycles; MUL 2 cycles; AL 5

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

a

|           |         |     | ·  | •               | 6  |     |     |     |    | R1 ready    |
|-----------|---------|-----|----|-----------------|----|-----|-----|-----|----|-------------|
|           |         |     |    | 1               | 7  |     |     |     |    | R1/R2 ready |
| LD        | R1,     | a   |    |                 | 8  | MUL | R1, | R1, | R2 | R2 ready    |
| ADD       | R1,     | R1, | R1 |                 | 9  | LD  | R2, | С   |    | R1 ready    |
| LD        | R2,     | b   |    | <u>schedule</u> | 10 |     |     |     |    | R1 ready    |
| MUL       | R1,     | R1, | R2 |                 | 11 |     |     |     |    | R1/R2 ready |
| LD        | R2,     | С   |    |                 | 12 | MUL | R1, | R1, | R2 | R2 ready    |
| MUL       | R1,     | R1, | R2 |                 | 13 |     |     |     |    | R1/R2 ready |
| ST        | а,      | R1  |    |                 | 14 | ST  | а,  | R1  |    |             |
| a = 2 * a | a * b * | С   |    |                 | 15 |     |     |     |    | •••••       |
|           |         |     |    |                 | 16 |     |     |     |    | Done!       |
|           |         |     |    |                 |    |     |     |     |    |             |

3

LD

ADD

LD

R1, a

R1, R1,

R2, b



Can a compiler generate better code?



- Example: Superscalar machine (1 functional unit)
  - LD/ST 3 cycles; MUL 2 cycles; ADD 1 cycle

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |



- Example: Superscalar machine (1 functional unit)
  - LD/ST 3 cycles; MUL 2 cycles; ADD 1 cycle

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |

| LD  | R1,         | а   |           |  |
|-----|-------------|-----|-----------|--|
| LD  | R2,         | b   |           |  |
| LD  | <u>R3</u> , | С   |           |  |
| ADD | R1,         | R1, | R1        |  |
| MUL | R1,         | R1, | R2        |  |
| MUL | R1,         | R1, | <u>R3</u> |  |
| ST  | a,          | R1  |           |  |

$$a = 2 * a * b * c$$



- Example: Superscalar machin
  - LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | a   |    |
|-----|-----|-----|----|
| LD  | R2, | b   |    |
| LD  | R3, | С   |    |
| ADD | R1, | R1, | R1 |
| MUL | R1, | R1, | R2 |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |

a = 2 \* a \* b \* c



| 1  | LD | R1, | a | •••••    |
|----|----|-----|---|----------|
| 2  |    |     |   |          |
| 3  |    |     |   | R1 ready |
| 4  |    |     |   |          |
| 5  |    |     |   |          |
| 6  |    |     |   |          |
| 7  |    |     |   |          |
| 8  |    |     |   |          |
| 9  |    |     |   |          |
| 10 |    |     |   |          |
| 11 |    |     |   |          |
| 12 |    |     |   |          |
| 13 |    |     |   |          |
| 14 |    |     |   |          |
| 15 |    |     |   |          |
| 16 |    |     |   |          |



- Example: Superscalar machin
  - LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| LD  | R2, | b   |    |
| LD  | R3, | С   |    |
| ADD | R1, | R1, | R1 |
| MUL | R1, | R1, | R2 |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |

a = 2 \* a \* b \* c



| 1  | LD | R1, | а | •••••    |
|----|----|-----|---|----------|
| 2  |    |     |   |          |
| 3  |    |     |   | R1 ready |
| 4  |    |     |   |          |
| 5  |    |     |   |          |
| 6  |    |     |   |          |
| 7  |    |     |   |          |
| 8  |    |     |   |          |
| 9  |    |     |   |          |
| 10 |    |     |   |          |
| 11 |    |     |   |          |
| 12 |    |     |   |          |
| 13 |    |     |   |          |
| 14 |    |     |   |          |
| 15 |    |     |   |          |
| 16 |    |     |   |          |



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |  |
|-----|-----|-----|----|--|
| LD  | R2, | b   |    |  |
| LD  | R3, | С   |    |  |
| ADD | R1, | R1, | R1 |  |
| MUL | R1, | R1, | R2 |  |
| MUL | R1, | R1, | R3 |  |
| ST  | a,  | R1  |    |  |



| 1  | LD | R1, | a |             |
|----|----|-----|---|-------------|
| 2  | LD | R2, | b |             |
| 3  |    |     |   | R1 ready    |
| 4  |    |     |   | R1/R2 ready |
| 5  |    |     |   |             |
| 6  |    |     |   |             |
| 7  |    |     |   |             |
| 8  |    |     |   |             |
| 9  |    |     |   |             |
| 10 |    |     |   |             |
| 11 |    |     |   |             |
| 12 |    |     |   |             |
| 13 |    |     |   |             |
| 14 |    |     |   |             |
| 15 |    |     |   |             |
| 16 |    |     |   |             |



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |   |
|-----|-----|-----|----|---|
| LD  | R2, | b   |    |   |
| LD  | R3, | С   |    |   |
| ADD | R1, | R1, | R1 |   |
| MUL | R1, | R1, | R2 |   |
| MUL | R1, | R1, | R3 |   |
| ST  | a,  | R1  |    | _ |



| 1  | LD | R1, | a |                |
|----|----|-----|---|----------------|
| 2  | LD | R2, | b |                |
| 3  | LD | R3, | С | R1 ready       |
| 4  |    |     |   | R1/R2 ready    |
| 5  |    |     |   | R1/R2/R3 ready |
| 6  |    |     |   |                |
| 7  |    |     |   |                |
| 8  |    |     |   |                |
| 9  |    |     |   |                |
| 10 |    |     |   |                |
| 11 |    |     |   |                |
| 12 |    |     |   |                |
| 13 |    |     |   |                |
| 14 |    |     |   |                |
| 15 |    |     |   |                |
| 16 |    |     |   |                |



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; Al

| LD  | R1, | а   |    |  |
|-----|-----|-----|----|--|
| LD  | R2, | b   |    |  |
| LD  | R3, | С   |    |  |
| ADD | R1, | R1, | R1 |  |
| MUL | R1, | R1, | R2 |  |
| MUL | R1, | R1, | R3 |  |
| ST  | a,  | R1  |    |  |



| 1  | LD  | R1, | а   |    |                |
|----|-----|-----|-----|----|----------------|
| 2  | LD  | R2, | b   |    |                |
| 3  | LD  | R3, | С   |    | R1 ready       |
| 4  | ADD | R1, | R1, | R1 | R1/R2 ready    |
| 5  |     |     |     |    | R1/R2/R3 ready |
| 6  |     |     |     |    |                |
| 7  |     |     |     |    |                |
| 8  |     |     |     |    |                |
| 9  |     |     |     |    |                |
| 10 |     |     |     |    |                |
| 11 |     |     |     |    |                |
| 12 |     |     |     |    |                |
| 13 |     |     |     |    |                |
| 14 |     |     |     |    |                |
| 15 |     |     |     |    |                |
| 16 |     |     |     |    |                |



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |  |
|-----|-----|-----|----|--|
| LD  | R2, | b   |    |  |
| LD  | R3, | С   |    |  |
| ADD | R1, | R1, | R1 |  |
| MUL | R1, | R1, | R2 |  |
| MUL | R1, | R1, | R3 |  |
| ST  | a,  | R1  |    |  |

a = 2 \* a \* b \* c



| 1       LD       R1, a          2       LD       R2, b          3       LD       R3, c       R1 ready         4       ADD       R1, R1, R1       R1/R2 ready         5       MUL       R1, R1, R2       R2/R3 ready         6       R1/R2/R3 ready         7          8          9          10          11          12          13          14          15          16 |    |     |     |     |    |                |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|-----|-----|-----|----|----------------|
| 3 LD R3, c R1 ready 4 ADD R1, R1, R1 R1/R2 ready 5 MUL R1, R1, R2 R2/R3 ready 6 R1/R2/R3 ready 7 8 9 10 11 12 13 14 15                                                                                                                                                                                                                                                 | 1  | LD  | R1, | a   |    |                |
| 4 ADD R1, R1, R1 R1/R2 ready 5 MUL R1, R1, R2 R2/R3 ready 7 8 9 10 11 12 13 14 15                                                                                                                                                                                                                                                                                      | 2  | LD  | R2, | b   |    |                |
| 5       MUL R1, R1, R2       R2/R3 ready         6       R1/R2/R3 ready         7       8         9       9         10       11         12       13         14       15                                                                                                                                                                                                | 3  | LD  | R3, | С   |    | R1 ready       |
| 6       R1/R2/R3 ready         7       8         9       9         10       9         11       11         12       12         13       14         15       15                                                                                                                                                                                                          | 4  | ADD | R1, | R1, | R1 | R1/R2 ready    |
| 7         8         9         10         11         12         13         14         15                                                                                                                                                                                                                                                                                | 5  | MUL | R1, | R1, | R2 | R2/R3 ready    |
| 8         9         10         11         12         13         14         15                                                                                                                                                                                                                                                                                          | 6  |     |     |     |    | R1/R2/R3 ready |
| 9                                                                                                                                                                                                                                                                                                                                                                      | 7  |     |     |     |    |                |
| 10         11         12         13         14         15                                                                                                                                                                                                                                                                                                              | 8  |     |     |     |    |                |
| 11         12         13         14         15                                                                                                                                                                                                                                                                                                                         | 9  |     |     |     |    |                |
| 12         13         14         15                                                                                                                                                                                                                                                                                                                                    | 10 |     |     |     |    |                |
| 13         14         15                                                                                                                                                                                                                                                                                                                                               | 11 |     |     |     |    |                |
| 14       15                                                                                                                                                                                                                                                                                                                                                            | 12 |     |     |     |    |                |
| 15                                                                                                                                                                                                                                                                                                                                                                     | 13 |     |     |     |    |                |
|                                                                                                                                                                                                                                                                                                                                                                        | 14 |     |     |     |    |                |
| 16                                                                                                                                                                                                                                                                                                                                                                     | 15 |     |     |     |    |                |
|                                                                                                                                                                                                                                                                                                                                                                        | 16 |     |     |     |    |                |



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| LD  | R2, | b   |    |
| LD  | R3, | С   |    |
| ADD | R1, | R1, | R1 |
| MUL | R1, | R1, | R2 |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |

<u>schedule</u>

| LD  | R2,              | b                      |                                  |                                          |
|-----|------------------|------------------------|----------------------------------|------------------------------------------|
| LD  | R3,              | С                      |                                  | R1 ready                                 |
| ADD | R1,              | R1,                    | R1                               | R1/R2 ready                              |
| MUL | R1,              | R1,                    | R2                               | R2/R3 ready                              |
|     |                  |                        |                                  | R1/R2/R3 ready                           |
| MUL | R1,              | R1,                    | R3                               | R2/R3 ready                              |
|     |                  |                        |                                  | R1/R2/R3 ready                           |
|     |                  |                        |                                  |                                          |
|     |                  |                        |                                  |                                          |
|     |                  |                        |                                  |                                          |
|     |                  |                        |                                  |                                          |
|     |                  |                        |                                  |                                          |
|     |                  |                        |                                  |                                          |
|     |                  |                        |                                  |                                          |
|     |                  |                        |                                  |                                          |
|     | LD<br>ADD<br>MUL | LD R3, ADD R1, MUL R1, | LD R3, c ADD R1, R1, MUL R1, R1, | LD R3, c  ADD R1, R1, R1  MUL R1, R1, R2 |

R1,

LD



#### Example: Superscalar machin

LD/ST 3 cycles; MUL 2 cycles; AL

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| LD  | R2, | b   |    |
| LD  | R3, | С   |    |
| ADD | R1, | R1, | R1 |
| MUL | R1, | R1, | R2 |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |

<u>schedule</u>

| 1  | LD  | R1, | a   |    | •••••          |
|----|-----|-----|-----|----|----------------|
| 2  | LD  | R2, | b   |    |                |
| 3  | LD  | R3, | С   |    | R1 ready       |
| 4  | ADD | R1, | R1, | R1 | R1/R2 ready    |
| 5  | MUL | R1, | R1, | R2 | R2/R3 ready    |
| 6  |     |     |     |    | R1/R2/R3 ready |
| 7  | MUL | R1, | R1, | R3 | R2/R3 ready    |
| 8  |     |     |     |    | R1/R2/R3 ready |
| 9  | ST  | a,  | R1  |    |                |
| 10 |     |     |     |    |                |
| 11 |     |     |     |    | Done!          |
| 12 |     |     |     |    |                |
| 13 |     |     |     |    |                |
| 14 |     |     |     |    |                |
| 15 |     |     |     |    |                |
| 16 |     |     |     |    |                |
|    |     |     |     |    |                |



#### Example: Superscalar machi

LD/ST 3 cycles; MUL 2 cycles; Al

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| LD  | R2, | b   |    |
| LD  | R3, | С   |    |
| ADD | R1, | R1, | R1 |
| MUL | R1, | R1, | R2 |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |

<u>schedule</u>

|        | 2  | LD  | R2, | b   |          |                |
|--------|----|-----|-----|-----|----------|----------------|
|        | 3  | LD  | R3, | С   |          | R1 ready       |
|        | 4  | ADD | R1, | R1, | R1       | R1/R2 ready    |
| Ĭ<br>Į | 5  | MUL | R1, | R1, | R2       | R2/R3 ready    |
|        | 6  |     |     |     |          | R1/R2/R3 ready |
|        | 7  | MUL | R1, | R1, | R3       | R2/R3 ready    |
|        | 8  |     |     |     |          | R1/R2/R3 ready |
|        | 9  | ST  | a,  | R1  |          |                |
|        | 10 |     |     |     |          |                |
|        | 11 |     |     |     |          | Done!          |
|        | 12 |     |     |     |          |                |
|        | 13 |     |     |     |          |                |
|        | 14 |     |     |     | $>\!\!<$ |                |
|        | 15 |     |     |     |          |                |
|        | 16 |     |     |     |          |                |

R1, a

| а | = | 2 | * | а | * | b | * | C |
|---|---|---|---|---|---|---|---|---|
|---|---|---|---|---|---|---|---|---|





Can we generate code to leverage modern CPUs for better instruction-level parallelization? Yes! Possible!

| LD  | R1, | a   |    |  |
|-----|-----|-----|----|--|
| ADD | R1, | R1, | R1 |  |
| LD  | R2, | b   |    |  |
| MUL | R1, | R1, | R2 |  |
| LD  | R2, | С   |    |  |
| MUL | R1, | R1, | R2 |  |
| ST  | а,  | R1  |    |  |

$$a = 2 * a * b * c$$





Can we generate code to leverage modern CPUs for better instruction-level parallelization? Yes! Possible!

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |







Can we generate code to leverage modern CPUs for better instruction-level parallelization? Yes! Possible!

- 1. Scheduling constraints
- 2. Basic block scheduling → 3. Global scheduling
- 4. Software pipelining (optional)



## **PART I: Scheduling Constraints**



# Performance is Order Dependent

| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |



# Performance is Order Dependent

| LD  | R1, | a   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | a,  | R1  |    |

| LD  | R1, | a   |    |             |
|-----|-----|-----|----|-------------|
| LD  | R2, | b   |    |             |
| LD  | R3, | С   |    |             |
| ADD | R1, | R1, | R1 |             |
| MUL | R1, | R1, | R2 |             |
| MUL | R1, | R1, | R3 |             |
| ST  | а,  | R1  |    |             |
|     |     |     |    | <b>-/</b> / |



True Dependence (Read-After-Write Dependence)

|   | LD  | R1, | a   |    |
|---|-----|-----|-----|----|
|   | ADD | R1, | R1, | R1 |
|   | LD  | R2, | b   |    |
| 1 | MUL | R1, | R1, | R2 |
|   | LD  | R2, | С   |    |
|   | MUL | R1, | R1, | R2 |
|   | ST  | a,  | R1  |    |





- Storage (Fake) Dependence --- Rename to Remove
  - Antidependence & Output dependence



- Storage (Fake) Dependence --- Rename to Remove
  - Antidependence (Write-After-Read) & Output dependence

|   | LD  | R1,         | a   |           |
|---|-----|-------------|-----|-----------|
|   | ADD | R1,         | R1, | R1        |
|   | LD  | R2,         | b   |           |
|   | MUL | R1,         | R1, | <u>R2</u> |
| 1 | LD  | <u>R2</u> , | С   |           |
|   | MUL | R1,         | R1, | R2        |
|   | ST  | a,          | R1  |           |

|     |             |     |           |   | _ |
|-----|-------------|-----|-----------|---|---|
| LD  | R1,         | a   |           |   |   |
| LD  | R2,         | b   |           |   |   |
| LD  | <u>R3</u> , | С   |           |   |   |
| ADD | R1,         | R1, | R1        |   |   |
| MUL | R1,         | R1, | <u>R2</u> |   |   |
| MUL | R1,         | R1, | R3        |   |   |
| ST  | a,          | R1  |           |   |   |
|     |             |     |           | _ |   |



- Storage (Fake) Dependence --- Rename to Remove
  - Antidependence & Output dependence (Write-After-Write)

| LD  | R1,               | a                                         |                                                           |
|-----|-------------------|-------------------------------------------|-----------------------------------------------------------|
| ADD | R1,               | R1,                                       | R1                                                        |
| LD  | <u>R2</u> ,       | b                                         |                                                           |
| MUL | R1,               | R1,                                       | R2                                                        |
| LD  | <u>R2</u> ,       | С                                         |                                                           |
| MUL | R1,               | R1,                                       | R2                                                        |
| ST  | a,                | R1                                        |                                                           |
|     | ADD LD MUL LD MUL | ADD R1,  LD R2,  MUL R1,  LD R2,  MUL R1, | ADD R1, R1,  LD R2, b  MUL R1, R1,  LD R2, c  MUL R1, R1, |

|   |     |             |     |    | _   |
|---|-----|-------------|-----|----|-----|
|   | LD  | R1,         | a   |    |     |
| - | LD  | <u>R2</u> , | b   |    |     |
|   | LD  | <u>R3</u> , | С   |    |     |
|   | ADD | R1,         | R1, | R1 |     |
|   | MUL | R1,         | R1, | R2 |     |
|   | MUL | R1,         | R1, | R3 |     |
|   | ST  | а,          | R1  |    |     |
|   |     |             |     |    | 7// |



Independent instructions can be scheduled in parallel



Independent instructions can be scheduled in parallel

We can remove fake dependence by renaming registers



Independent instructions can be scheduled in parallel

We can remove fake dependence by renaming registers

• However, ...



Independent instructions can be scheduled in parallel

We can remove fake dependence by renaming registers

However, # registers is limited! (Recall register allocation)



- Which should we do first?
- Register Allocation: Designed to use fewer registers
- Instruction Scheduling: Prefer more registers





- Which should we do first?
- Register Allocation: Designed to use fewer registers
- Instruction Scheduling: Prefer more registers



- It's a tough question! It depends on applications
- Multi-objective optimization <a href="https://en.wikipedia.org/wiki/Multi-objective\_optimization">https://en.wikipedia.org/wiki/Multi-objective\_optimization</a>



# PART II: Basic Block Scheduling



Node: Instructions

Edge: True Dependence (What is a true dependence?)



Node: Instructions

• Edge: True Dependence (Read-After-Write Dependence)



| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |





| LD  | R1, | a   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R2, | С   |    |
| MUL | R1, | R1, | R2 |
| ST  | а,  | R1  |    |



Recap: Fake dependence. How can we eliminate it?



| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | а,  | R1  |    |





| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | а,  | R1  |    |







| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | а,  | R1  |    |







| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |







| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |





| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |





| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |





| LD  | R1, | а   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | a,  | R1  |    |





| LD  | R1, | a   |    |
|-----|-----|-----|----|
| ADD | R1, | R1, | R1 |
| LD  | R2, | b   |    |
| MUL | R1, | R1, | R2 |
| LD  | R3, | С   |    |
| MUL | R1, | R1, | R3 |
| ST  | а,  | R1  |    |





- Check each clock cycle
- Schedule instructions when they are ready



- Check each clock cycle
- Schedule instructions when they are ready
- When multi instructions can be scheduled, check their priority



- Check each clock cycle
- Schedule instructions when they are ready
- When multi instructions can be scheduled, check their priority
  - The longest latency path (max depth)
  - Using the most resources

• . . .





























| R1, | a |  |
|-----|---|--|
| R2, | b |  |
|     |   |  |
|     |   |  |
|     |   |  |
|     |   |  |
|     |   |  |
|     |   |  |
|     |   |  |
|     |   |  |





| LD | R1, | a |  |
|----|-----|---|--|
| LD | R2, | b |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |





| LD | R1, | a |  |
|----|-----|---|--|
| LD | R2, | b |  |
| LD | R3, | С |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |





| LD | R1, | a |  |
|----|-----|---|--|
| LD | R2, | b |  |
| LD | R3, | С |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |





| LD | R1, | а |  |
|----|-----|---|--|
| LD | R2, | b |  |
| LD | R3, | С |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |





| LD | R1, | а |  |
|----|-----|---|--|
| LD | R2, | b |  |
| LD | R3, | С |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |
|    |     |   |  |





| LD  | R1, | a   |    |
|-----|-----|-----|----|
| LD  | R2, | b   |    |
| LD  | R3, | С   |    |
| ADD | R1, | R1, | R1 |
|     |     |     |    |
|     |     |     |    |
|     |     |     |    |
|     |     |     |    |
|     |     |     |    |





| LD  | R1, | а   |    |
|-----|-----|-----|----|
| LD  | R2, | b   |    |
| LD  | R3, | С   |    |
| ADD | R1, | R1, | R1 |
|     |     |     |    |
|     |     |     |    |
|     |     |     |    |
|     |     |     |    |
|     |     |     |    |





| LD  | R1, | а   |    |  |
|-----|-----|-----|----|--|
| LD  | R2, | b   |    |  |
| LD  | R3, | С   |    |  |
| ADD | R1, | R1, | R1 |  |
| MUL | R1, | R1, | R2 |  |
|     |     |     |    |  |
|     |     |     |    |  |
|     |     |     |    |  |
|     |     |     |    |  |





| LD  | R1,   | а     |      |
|-----|-------|-------|------|
| LD  | R2,   | b     |      |
| LD  | R3,   | С     |      |
| ADD | R1,   | R1,   | R1   |
| MUL | R1,   | R1,   | R2   |
| NOP | (no o | perat | ion) |
|     |       |       |      |
|     |       |       |      |
|     |       |       |      |





| LD  | R1,   | а     |      |
|-----|-------|-------|------|
| LD  | R2,   | b     |      |
| LD  | R3,   | С     |      |
| ADD | R1,   | R1,   | R1   |
| MUL | R1,   | R1,   | R2   |
| NOP | (no o | perat | ion) |
|     |       |       |      |
|     |       |       |      |
|     |       |       |      |





| LD  | R1,   | а     |      |
|-----|-------|-------|------|
| LD  | R2,   | b     |      |
| LD  | R3,   | С     |      |
| ADD | R1,   | R1,   | R1   |
| MUL | R1,   | R1,   | R2   |
| NOP | (no o | perat | ion) |
|     |       |       |      |
|     |       |       |      |
|     |       |       |      |





• Cycle = 7, 8

| LD  | R1,   | а     |      |  |
|-----|-------|-------|------|--|
| LD  | R2,   | b     |      |  |
| LD  | R3,   | С     |      |  |
| ADD | R1,   | R1,   | R1   |  |
| MUL | R1,   | R1,   | R2   |  |
| NOP | (no o | perat | ion) |  |
| MUL | R1,   | R1,   | R3   |  |
| NOP | (no o | perat | ion) |  |
|     |       |       |      |  |





• Cycle = 9, 10, 11

| LD  | R1,   | a     |      |  |
|-----|-------|-------|------|--|
| LD  | R2,   | b     |      |  |
| LD  | R3,   | С     |      |  |
| ADD | R1,   | R1,   | R1   |  |
| MUL | R1,   | R1,   | R2   |  |
| NOP | (no o | perat | ion) |  |
| MUL | R1,   | R1,   | R3   |  |
| NOP | (no o | perat | ion) |  |
| ST  | a,    | R1    |      |  |

ST a, R1







| LD  | R1, | a   |    |  |
|-----|-----|-----|----|--|
| LD  | R2, | b   |    |  |
| LD  | R3, | С   |    |  |
| ADD | R1, | R1, | R1 |  |
| MUL | R1, | R1, | R2 |  |
| MUL | R1, | R1, | R3 |  |
| ST  | a,  | R1  |    |  |



#### **Summary of List Scheduling**

- Check each clock cycle
- Schedule instructions when they are ready
- When multiple instructions can be scheduled, check their priority
  - The longest latency path (max depth)
  - Using the most resources

• ...



# PART III: Global Code Scheduling



#### **Recap: Dominance**

- A dominates B if and only if ...?
- A strictly dominates B if and only if ...?



#### **Recap: Dominance**

- A dominates B if and only if ...?
- A strictly dominates B if and only if ...?

- A post-dominates B if and only if ...?
- A strictly post-dominates B if and only if ...?



- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.





- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.







- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.





- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.





- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.

- Three EBBs in the example
- Four EBB paths





- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.

- Three EBBs in the example
- Four EBB paths





**Extended Basic Block (EBB)** 

- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.

- Three EBBs in the example
- Four EBB paths





# **Extended Basic Block (EBB)**

- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.

- Three EBBs in the example
- Four EBB paths





**Extended Basic Block (EBB)** 

- EBB is a maximal set of blocks such that
- (1) It has a single entry;
- (2) All blocks except for the entry has only one predecessor.

- Three EBBs in the example
- Four EBB paths





 Schedule each EBB path like a basic block, e.g., via list scheduling





 Schedule each EBB path like a basic block, e.g., via list scheduling

 Scheduling may change the order of instructions





 Schedule each EBB path like a basic block, e.g., via list scheduling

- Scheduling may change the order of instructions
  - Downward code motion
  - Upward code motion





 Schedule each EBB path like a basic block, e.g., via list scheduling

 Scheduling may change the order of instructions

 Extra code may need to be inserted





• I. src does not dominate dst





• I. src does not dominate dst





• I. src does not dominate dst





• II. *dst* does not post-dominate *src* 





• II. *dst* does not post-dominate *src* 





• II. *dst* does not post-dominate *src* 





• III. src dom dst & dst post-dom src





 Schedule an EBB path like a basic block, e.g., via list scheduling.

 Scheduling may change the order of instructions

 Extra code may need to be inserted





• I. dst does not dominate src





• I. dst does not dominate src





• I. dst does not dominate src





• II. src does not post-dominate dst





• II. src does not post-dominate dst





• II. src does not post-dominate dst





• III. dst dom src & src post-dom dst





 After scheduling an EBB path, remove it from the graph

Then schedule the next





 After scheduling an EBB path, remove it from the graph

 Then schedule the next, until all EBB paths are scheduled





Join points





- Join points
- Context Problem:
  - From B2 to B5, Undo h
  - From B3 to B5, Undo h changes the code semantics





- Join points
- Context Problem:
  - From B2 to B5, Undo h
  - From B3 to B5, Undo h
     changes the code semantics
- Clone blocks to address the context problem





- Join points
- Context Problem:
  - From B2 to B5, Undo h
  - From B3 to B5, Undo h
     changes the code semantics
- Clone blocks to address the context problem





- Join points
- Context Problem:
  - From B2 to B5, Undo h
  - From B3 to B5, Undo h changes the code semantics
- Clone blocks to address the context problem





- Join points
- Context Problem:
  - From B2 to B5, Undo h
  - From B3 to B5, Undo h changes the code semantics
- Clone blocks to address the context problem





# **Trace Scheduling**



## **Trace Scheduling**

Profile and schedule the hot path





# **Trace Scheduling**

Profile and schedule the hot path

- Profiling-guided optimization
- Just-in-time compiler (that in JVM)

What is JIT?





# **PART IV: Software Pipelining**



### **How about Loops?**

 We can schedule the loop body as a single block



## **How about Loops?**

 We can schedule the loop body as a single block





### **How about Loops?**

- We can schedule the loop body as a single block
- Miss cross-iteration parallelism





```
for (i = 0; .....; i += 4)

D[i] = A[i] * B[i] + c

D[i+1] = A[i+1] * B[i+1] + c

D[i+2] = A[i+2] * B[i+2] + c

D[i+3] = A[i+3] * B[i+3] + c
```



```
for (i = 0; .....; i += 4)
D[i] = A[i] * B[i] + c
D[i+1] = A[i+1] * B[i+1] + c
D[i+2] = A[i+2] * B[i+2] + c
D[i+3] = A[i+3] * B[i+3] + c
```





 We still schedule the loop body as a single block

```
for (i = 0; .....; i += 4)

D[i] = A[i] * B[i] + c

D[i+1] = A[i+1] * B[i+1] + c

D[i+2] = A[i+2] * B[i+2] + c

D[i+3] = A[i+3] * B[i+3] + c
```





- We still schedule the loop body as a single block
- Let us schedule instructions across multiple loop iterations

```
for (i = 0; .....; i += 4)

D[i] = A[i] * B[i] + c

D[i+1] = A[i+1] * B[i+1] + c

D[i+2] = A[i+2] * B[i+2] + c

D[i+3] = A[i+3] * B[i+3] + c
```





# **Problems of Loop Unrolling**

- We cannot unroll too many times, which
- I. makes code size larger
  - cache pressure
  - register pressure
  - (slows down the code)
- II. and does not break the boundary of loop iteration

```
for (i = 0; .....; i += 4)
D[i] = A[i] * B[i] + c
D[i+1] = A[i+1] * B[i+1] + c
D[i+2] = A[i+2] * B[i+2] + c
D[i+3] = A[i+3] * B[i+3] + c
```





# **Software Pipelining**

- Break the boundary of loop iterations
- Regard a loop as a whole, not separate blocks
- More chances of parallelism without bloating the code



# **Software Pipelining**

- Break the boundary of loop iterations
- Regard a loop as a whole, not separate blocks
- More chances of parallelism without bloating the code

Refer to Chapter 10, the Dragon book



# **Software Pipelining**

- Break the boundary of loop iterations
- Regard a loop as a whole, not separate blocks
- More chances of parallelism without bloating the code

- Refer to Chapter 10, the Dragon book
- No Silver Bullet!



#### Summary





#### Summary



Can we generate code to leverage modern CPUs for better instruction-level parallelization? Yes! Possible!



#### Summary



Can we generate code to leverage modern CPUs for better instruction-level parallelization? Yes! Possible!

- Scheduling a basic block
- Scheduling extended basic blocks
- Scheduling loops (optional)



#### **THANKS!**